1 Abstract

Air pollution is one of the major concern in today’s world. It can have very serious cost, penalites and consequences for the health of human beings and also ruthlessly distresses natural bio-network and ecosystems.Out of five major air pollutants (round-level ozone, particle pollution also known as particulate matter, carbon monoxide, sulfur dioxide, and nitrogen dioxide) WHO has identified SPM (Suspended Particulated Matter) as most sinister in terms of its effects on health. You may experience health issues such as Irregular heartbeat,Respiratory symptoms like coughing, wheezing or difficulty breathing etc. Therefore measuring these pollutants on daily basis and forecast is vital for any government. Daily measuring and forecasts can help Governments in designing policies to control air pollution and issue a guidance to citizens of possible health impacts and precautions to be taken if air quality becomes bad or predicted to be bad on next day.

2 Intro to Dataset

With this little background, for academic purposes we have decided to use Air Quality dataset from UCI machine learning respository for time series term project. This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017.

This hourly data set considers 6 main air pollutants and 6 relevant meteorological variables at multiple sites in Beijing.

SrNo Air Pollutants Description
1 PM2.5 PM2.5 concentration (ug/\(m^{3}\))
2 PM10 PM10 concentration (ug/\(m^{3}\))
3 SO2 Sulphur Dioxide concentration (ug/\(m^{3}\))
4 NO2 Nitrogen Dioxide concentration (ug/\(m^{3}\))
5 CO Carbon Monoxide concentration (ug/\(m^{3}\))
6 O3 Ozone concentration (ug/\(m^{3}\))
7 TEMP temperature (degree Celsius)
8 PRES pressure (hPa)
9 DEWP dew point temperature (degree Celsius)
10 RAIN precipitation (mm)
11 wd wind direction
12 WSPM wind speed (m/s)

Goal
Goal of this project is to develop a model to forecast on all major air pollutants PM2.5,PM10,SO2,NO2,CO and O3 at least for next 10 days and effect of these pollutants on temperature and rain (Multivariate Analysis)

This dataset contains observations from 12 Chinese cities. For initial analysis (EDA) we are using observations from the city called Dongsi and it will be extended to two more cities for the purpose of the project and will try fit best possible model based on analysis.

Following are city names :

SrNo City
1 Aoizhongxin
2 Changping
3 Dingling
4 Dongsi
5 Guanyuan
6 Gucheng
7 Huairou
8 Nongzhanguan
9 Shunyi
10 Titantan
11 Wanliu
12 Wanshouxigong

Let’s use data from the city called Dongsi.

3 Analysis of Dataset

3.1 Structure

## 'data.frame':    35064 obs. of  18 variables:
##  $ No     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year   : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ day    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ PM2.5  : num  9 4 7 3 3 4 5 3 3 3 ...
##  $ PM10   : num  9 4 7 3 3 4 5 6 6 6 ...
##  $ SO2    : num  3 3 NA 5 7 9 10 12 12 9 ...
##  $ NO2    : num  17 16 17 18 NA 25 29 40 41 31 ...
##  $ CO     : int  300 300 300 NA 200 300 400 400 500 400 ...
##  $ O3     : num  89 88 60 NA 84 78 67 52 54 69 ...
##  $ TEMP   : num  -0.5 -0.7 -1.2 -1.4 -1.9 -2.4 -2.5 -1.4 -0.3 0.4 ...
##  $ PRES   : num  1024 1025 1025 1026 1027 ...
##  $ DEWP   : num  -21.4 -22.1 -24.6 -25.5 -24.5 -21.3 -20.4 -20.4 -21.2 -23.3 ...
##  $ RAIN   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ wd     : Factor w/ 16 levels "E","ENE","ESE",..: 7 8 7 4 7 8 8 7 8 4 ...
##  $ WSPM   : num  5.7 3.9 5.3 4.9 3.2 2.4 2.2 3 4.6 5.5 ...
##  $ station: Factor w/ 1 level "Dongsi": 1 1 1 1 1 1 1 1 1 1 ...

Combine Day Month and Year fields into single Date field.

3.2 Missing Values

3.3 Imputations

For imputation we will be using R package imputeTS . It has different methods depending on seasonality and trends to impute values for time series.

So let’s first plot the each series by excluding rows with NA values to check if there is trend or seasonality in the data and apply these methods [1] accordingly.

3.3.1 PM2.5 Series

Next Observartion Carried Backward Method

3.3.2 PM10 Series

Next Observartion Carried Backward Method

3.3.3 SO2 Series

Seasonal Adjustment then LOCF

## Warning: na.seadec will replaced by na_seadec.
##     Functionality stays the same.
##     The new function name better fits modern R code style guidelines.
##     Please adjust your code accordingly.
## Warning in na_seadec(x, algorithm, find_frequency, maxgap, ...): No seasonality information for dataset could be found, going on without decomposition.
##               Setting find_frequency=TRUE might be an option.

3.3.4 NO2 Series

Next Observartion Carried Backward Method

3.3.5 O3 Series

Seasonal Adjustment then LOCF

## Warning: na.seadec will replaced by na_seadec.
##     Functionality stays the same.
##     The new function name better fits modern R code style guidelines.
##     Please adjust your code accordingly.
## Warning in na_seadec(x, algorithm, find_frequency, maxgap, ...): No seasonality information for dataset could be found, going on without decomposition.
##               Setting find_frequency=TRUE might be an option.

3.3.6 CO Series

Seasonal Adjustment then LOCF

## Warning: na.seadec will replaced by na_seadec.
##     Functionality stays the same.
##     The new function name better fits modern R code style guidelines.
##     Please adjust your code accordingly.
## Warning in na_seadec(x, algorithm, find_frequency, maxgap, ...): No seasonality information for dataset could be found, going on without decomposition.
##               Setting find_frequency=TRUE might be an option.

3.3.8 Final Dataset

Final dataset for this time series analysis is obtained by converting hourly observations to averaged daily observations. This data will then be used to 10 day forecast.

4 PM2.5 Analysis

Following analysis is done by using tswge R package that we have used exstensively this course.

4.1 Stationarity

First plot the data.

  • No visible trend and seasonality in realization
  • ACF cuts off after 2nd lag. No indication of non-stationarity
  • Spectral density plot shows peak at 0, indicating wandering behavior
  • First and second half of ACFs shows no significant difference.
  • This is stationary series.

4.2 Trend/Signal check

## 
## Call:
## lm(formula = PM25 ~ t, data = ts_daily_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83.44 -52.83 -19.90  28.30 481.39 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 87.0021727  3.8605346  22.536   <2e-16 ***
## t           -0.0009465  0.0045744  -0.207    0.836    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.74 on 1459 degrees of freedom
## Multiple R-squared:  2.934e-05,  Adjusted R-squared:  -0.000656 
## F-statistic: 0.04281 on 1 and 1459 DF,  p-value: 0.8361

p-value > 0.05 (using OLS method) indicates b1 = 0 indicates no signal present in the data

4.3 Model ID

AR(1,1) turns out to be the best mode as per AIC function.

4.4 Estimate Parameters

4.4.1 Estimation

## 
## Coefficients of Original polynomial:  
## 1.5740 -0.7845 0.2155 -0.0550 0.0284 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9633B              1.0381               0.9633       0.0000
## 1-0.8349B+0.2773B^2    1.5056+-1.1575i      0.5266       0.1043
## 1+0.2242B+0.1062B^2   -1.0558+-2.8813i      0.3259       0.3059
##   
## 
## [1]  1.57395167 -0.78450207  0.21553552 -0.05495932  0.02836324
## [1] 0.933258
## [1] 86.31029

4.4.2 Residual diagnositcs

## Obs -0.0002931233 -0.001050439 -0.002165653 -0.0002722587 0.006506776 -0.01155279 -0.0435071 -0.005600042 0.01912308 0.01538431 0.01577499 -0.02114577 0.00706101 0.01408816 0.027602 0.004925593 -0.01359664 -0.01152449 -0.01264407 -0.02226326 0.02887934 0.02193741 -0.02797961 0.03644576
## $test
## [1] "Ljung-Box test"
## 
## $K
## [1] 24
## 
## $chi.square
## [1] 13.07265
## 
## $df
## [1] 18
## 
## $pval
## [1] 0.7872393

4.5 Final Model

Final PM2.5 Model:
(1 - 1.58B + 0.79\(B^2\) - 0.22\(B^3\) + 0.06\(B^4\) - 0.03\(B^5\)) (\(X_t\) - 85.96) = (1 - 0.93B) (\(a_t\))

4.6 PM2.5 Forecasting

5 PM10 Analysis

5.1 Stationarity

First plot the data.

  • No visible trend and seasonality in realization
  • ACF cuts off after 2nd lag. No indication of non-stationarity
  • Spectral density plot shows peak at 0, indicating wandering behavior
  • First and second half of ACFs shows no significant difference.
  • This is stationary series.

5.2 Trend/Signal check

## 
## Call:
## lm(formula = PM10 ~ t, data = ts_daily_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -105.68  -58.99  -19.56   32.07  473.75 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.102e+02  4.208e+00   26.19   <2e-16 ***
## t           5.000e-04  4.986e-03    0.10     0.92    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.38 on 1459 degrees of freedom
## Multiple R-squared:  6.894e-06,  Adjusted R-squared:  -0.0006785 
## F-statistic: 0.01006 on 1 and 1459 DF,  p-value: 0.9201

p-value > 0.05 (using OLS method) indicates b1 = 0 indicates no signal present in the data

5.3 Model ID

AR(6,1) turns out to be the best mode as per AIC function.

5.4 Estimate Parameters

5.4.1 Estimation

## 
## Coefficients of Original polynomial:  
## 1.5341 -0.7206 0.1976 -0.0402 -0.0373 0.0481 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9717B              1.0291               0.9717       0.0000
## 1-1.0950B+0.4195B^2    1.3052+-0.8248i      0.6477       0.0897
## 1+0.1225B+0.2875B^2   -0.2130+-1.8528i      0.5362       0.2682
## 1+0.4101B             -2.4383               0.4101       0.5000
##   
## 
## [1]  1.53414843 -0.72055103  0.19758558 -0.04024130 -0.03726040  0.04806275
## [1] 0.9325623
## [1] 110.5843

5.4.2 Residual diagnositcs

## Obs -0.00104693 -0.002345418 0.0006287611 -0.001048871 0.005649315 0.004902281 -0.03849331 -0.002147024 0.005807087 0.04055073 -0.01048374 -0.01530926 0.00174568 0.01310702 0.01360484 0.002005844 -0.01765535 -0.046519 -0.01001625 -0.02519184 0.007243046 0.03297038 -0.01377204 0.04518259
## $test
## [1] "Ljung-Box test"
## 
## $K
## [1] 24
## 
## $chi.square
## [1] 15.5644
## 
## $df
## [1] 17
## 
## $pval
## [1] 0.5548968

5.5 Final Model

Final PM10 Model:

(1 - 1.53B + 0.72\(B^2\) - 0.18\(B^3\) + 0.04\(B^4\) + 0.03\(B^5\) - 0.05\(B^5\) ) (\(X_t\) - 110.111) = (1 - 0.93B) (\(a_t\))

5.6 PM10 Forecasting

## 
## Coefficients of Original polynomial:  
## 1.5341 -0.7206 0.1976 -0.0402 -0.0373 0.0481 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9717B              1.0291               0.9717       0.0000
## 1-1.0950B+0.4195B^2    1.3052+-0.8248i      0.6477       0.0897
## 1+0.1225B+0.2875B^2   -0.2130+-1.8528i      0.5362       0.2682
## 1+0.4101B             -2.4383               0.4101       0.5000
##   
## 
## 
## Coefficients of Original polynomial:  
## 0.9326 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9326B              1.0723               0.9326       0.0000
##   
## 

6 SO2 Analysis

6.1 Stationarity

First plot the data.

  • Visual evidence of trend in some part of realization.
  • Realization show periodic behavior but that is not reflected in spectral density.
  • Peak at 0 and 0.13. But that’s not a strong evidence.
  • ACFs are not damping but has very mild evidence of sinusoidal behavior
  • Mild visual evidence that first half and second half of ACFs are different.
  • Time series not a stationary time series.

6.2 Trend/Signal check

## 
## Call:
## lm(formula = SO2 ~ t, data = ts_daily_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.129 -12.442  -5.715   6.298 115.679 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.244311   0.994978   28.39   <2e-16 ***
## t           -0.013293   0.001179  -11.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.01 on 1459 degrees of freedom
## Multiple R-squared:  0.08015,    Adjusted R-squared:  0.07952 
## F-statistic: 127.1 on 1 and 1459 DF,  p-value: < 2.2e-16
## Call:
## lm(formula = SO2 ~ t, data = ts_daily_df)
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 28.3959630  2.0767594  13.673 < 2.2e-16 ***
## t           -0.0134533  0.0024557  -5.478 5.052e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.8334 on 1458 degrees of freedom
## Multiple R-squared:  0.0202 ,  Adjusted R-squared:  0.0195
## F-statistic: 30 on 1 and 1458 DF,  p-value: < 5.052e-08
## 
## Durbin-Watson statistic 
## (original):    0.74940 , p-value: 5.398e-127
## (transformed): 2.01091 , p-value: 5.723e-01

p-value < 0.05 (using Cochrane-Orcutt Test) provides strong evidence that trend exist.

6.3 Model ID

AR(9,1) turns out to be the best mode as per AIC function.

6.4 Estimate Parameters

6.4.1 Estimation

## 
## Coefficients of Original polynomial:  
## 1.4029 -0.5464 0.0795 -0.0164 0.1073 -0.0848 -0.0244 0.2167 -0.1419 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9908B              1.0093               0.9908       0.0000
## 1-1.2552B+0.6852B^2    0.9160+-0.7877i      0.8277       0.1130
## 1-0.2310B+0.6437B^2    0.1794+-1.2335i      0.8023       0.2270
## 1+1.0585B+0.6216B^2   -0.8514+-0.9401i      0.7884       0.3671
## 1+0.7307B             -1.3685               0.7307       0.5000
## 1-0.7152B              1.3983               0.7152       0.0000
##   
## 

## 
## Coefficients of Original polynomial:  
## 0.4225 -0.1330 -0.0516 -0.0683 0.0401 -0.0463 -0.0702 0.1453 0.0051 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.2593B+0.6901B^2    0.9123+-0.7853i      0.8307       0.1131
## 1-0.2364B+0.6462B^2    0.1829+-1.2304i      0.8039       0.2265
## 1+1.0516B+0.6179B^2   -0.8510+-0.9456i      0.7860       0.3666
## 1-0.7387B              1.3536               0.7387       0.0000
## 1+0.7255B             -1.3783               0.7255       0.5000
## 1+0.0348B             -28.7668               0.0348       0.5000
##   
## 
## [1]  0.422529679 -0.133032504 -0.051579252 -0.068347519  0.040095263
## [6] -0.046341580 -0.070162449  0.145306647  0.005133966
## [1] 0.907354
## Obs -0.001752766 0.005348165 -0.00141413 0.0006021668 -0.003601681 0.00736844 -0.003218477 0.0006892955 0.01216199 -0.01767099 -0.01022156 -0.02613282 0.05670951 -0.02612948 0.05192423 -0.04315283 -0.04373766 0.01366933 -0.04351761 -0.004347992 -0.006651504 0.01523455 0.03650025 -0.002965029
## $test
## [1] "Ljung-Box test"
## 
## $K
## [1] 24
## 
## $chi.square
## [1] 22.83122
## 
## $df
## [1] 14
## 
## $pval
## [1] 0.06307211

6.4.2 Residual diagnositcs

## Obs -0.003239577 0.002701459 0.001243839 -0.005176032 -0.008703925 0.003541291 -0.004860781 -0.002254335 0.01461634 -0.02273508 -0.01016355 -0.02621313 0.05318427 -0.02956708 0.05061178 -0.04635682 -0.0420046 0.01542838 -0.04253254 -0.0039102 -0.005211346 0.0171068 0.03978618 -0.002457134
## $test
## [1] "Ljung-Box test"
## 
## $K
## [1] 24
## 
## $chi.square
## [1] 23.41837
## 
## $df
## [1] 20
## 
## $pval
## [1] 0.2687217

6.5 Final Model

Final SO2 Model:

(1 - 1.40B + 0.558\(B^2\) - 0.088\(B^3\) + 0.028\(B^4\) - 0.118\(B^5\) + 0.088\(B^6\) + 0.028\(B^7\) - 0.218\(B^8\) + 0.148\(B^2\))(\(X_t\) - 18.53) = (1 - 0.89B) (\(a_t\))

6.6 SO2 Forecasting

## ASE

## [1] 178.5033
## 
## Coefficients of Original polynomial:  
## 0.4225 -0.1330 -0.0516 -0.0683 0.0401 -0.0463 -0.0702 0.1453 0.0051 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.2593B+0.6901B^2    0.9123+-0.7853i      0.8307       0.1131
## 1-0.2364B+0.6462B^2    0.1829+-1.2304i      0.8039       0.2265
## 1+1.0516B+0.6179B^2   -0.8510+-0.9456i      0.7860       0.3666
## 1-0.7387B              1.3536               0.7387       0.0000
## 1+0.7255B             -1.3783               0.7255       0.5000
## 1+0.0348B             -28.7668               0.0348       0.5000
##   
## 
## 
## Coefficients of Original polynomial:  
## 0.9074 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9074B              1.1021               0.9074       0.0000
##   
## 

7 Conclusion

  • Forecast from ARMA model are not correct forecast
  • AR(1) is the strongest behavior in the above forecast. Doesn’t really follow the trend and Seasonality.
  • Multivariate analysis is required on this dataset
  • PM2.5 and PM10 forecasts may be well explained if CO,SO2 and O3 are taken into account.

8 What Next ?

Final part of the project will include following

  • Multivariate Analysis (include all pollutants)
  • Effect of pollutants on Rain and Temperature
  • Develop models with Neural Network.

9 References

  • Applied Time Series with R, Dr.Wayne Woodword,Henry L. Gray,Alan C. Elliott
  • An article on ENVIRONMENTAL EFFECTS OF AIR POLLUTION AND APPLICATION OF ENGINEERED METHODS TO COMBAT THE PROBLEM, Journal of Industrial Pollution Control.
  • Middle Tennessee Air Quality Forecast , Nashville Health Department.
  • An article on Major Air pollutants.
  • An article on common Air Pollutants and its effects on human beings.
  • World Air Quality Forecast
  • Kaggle article on handling Time series missing values. [1]